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Abstract 

In this paper, we investigate the problem of optimization multivariate per¬ 
formance measures, and propose a novel algorithm for it. Different from 
traditional machine learning methods which optimize simple loss functions 
to learn prediction function, the problem studied in this paper is how to 
learn effective hyper-predictor for a tuple of data points, so that a complex 
loss function corresponding to a multivariate performance measure can be 
minimized. We propose to present the tuple of data points to a tuple of 
sparse codes via a dictionary, and then apply a linear function to compare 
a sparse code against a give candidate class label. To learn the dictionary, 
sparse codes, and parameter of the linear function, we propose a joint op¬ 
timization problem. In this problem, the both the reconstruction error and 
sparsity of sparse code, and the upper bound of the complex loss function 
are minimized. Moreover, the upper bound of the loss function is approxi¬ 
mated by the sparse codes and the linear function parameter. To optimize 
this problem, we develop an iterative algorithm based on descent gradient 
methods to learn the sparse codes and hyper-predictor parameter alternately. 
Experiment results on some benchmark data sets show the advantage of the 
proposed methods over other state-of-the-art algorithms. 
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1. Introduction 


In traditional machine learning methods, we nsnally nse a loss fnnction to 
compare the trne class label of a data point against its predicted class label. 
By optimizing the loss fnnctions over all the training set, we seek a optimal 
prediction fnnction, named a classifier 0,330,0. For example, in snpport 
vector machine (SVM), a hinge loss fnnction is minimized, and in linear 
regression (LR), a logistic loss fnnction is nsed 0,330- However, when we 
evalnate the performance of a class label predictor, we nsnally consider a tnple 
of data points, and nse a complex mnltivariate performance measure over 
the considered tuple of data points, which is different from the loss functions 
used in the training procedure significantly [l^, 11, 12, l3 10 • For example, 
we may use area under receiver operating characteristic curve (AUC) as a 
multivariate performance measure to evaluate the classification performance 
of SVM. Because SVM class label predictor is trained by minimizing the loss 
functions over training data points, it cannot be guaranteed to minimize the 
loss function corresponding to AUC. Many other multivariate performance 
measures are also defined to compare a true class label tuple of a data point 
tuple against its predicted class label tuple, and they can also be used for 
different machine learning applications. Some examples of the multivariate 
performance measures are as F-score B Bl, precision-recall curve eleven 
point (PRBEP) B 1S|, and Matthews correlation coefficient (MCC) B) 20 


To seek the optimal multivariate performance measures on a given tuple 
of data points, recently, the problem of multivariate performance measure 
optimization is proposed. This problem is defined as a problem of learning 
a hyper-predictor for a tuple of data points to predict a tuple of class labels. 
The hyper-predictor is learned so that a multivariate performance measure 
used to compare the true class label tuple and the predicted class label tuple 
can be optimized directly. 


1.1. Related works 

Some methods have been proposed to solve the problem of multivariate 
performance measures. For example, 

• Joachims Bl proposed a SVM method to optimize multivariate nonlin¬ 
ear performance measures, including F-score, AUC etc. This method 
takes a multivariate predictor, and gives an algorithm to train the a 
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multivariate SVM in polynomial time for large classes so that the po¬ 
tentially non-linear performance measures can be optimized. Moreover, 
the translational SVM with hinge loss function can be treated as a spe¬ 
cial case of this method. 


Zhang et al. [12| proposed a smoothing strategy for multivariate per¬ 


formance score optimization., in particular PREBP and AUC. The pro¬ 
posed method combines Nesterov’s accelerated gradient algorithm and 
the smoothing strategy, and obtains an optimization algorithm. This 
algorithm converges to a given accurate solution in a limited number 
of iterations corresponding to the accurate. 


• Mao and Tsang [l^ proposed a generalized sparse regularizer for mul¬ 
tivariate performance measure optimization. Based on the this regular¬ 
izer, a unihed feature selection and general loss function optimization 
is developed. The formulation of the problem is solved by a two-layer 
cutting plane algorithm, and the convergence is presented. Moreover, 
it can also be used to optimize the multivariate measures of multiple- 
instance learning problems. 


Li et al. j2l| proposed to learn a nonlinear classiher for optimization 


of nonlinear and nonsmooth performance measures by novel two-step 
approach. Firstly, a nonlinear auxiliary classihers with existing learning 
methods is trained, and then it is adapted for specihc performance 
measures. The classiher adaptation can be reduced to a quadratic 
program problem, similar to the method introduced in [ir 


1.2. Contributions 

In this paper, we try to investigate the usage of sparse coding in the prob¬ 
lem of multivariate performance optimization. Our work is inspired by the 
work of multivariate performance optimization using multiple kernel learning 
proposed by Wang, et al. [^. The work in ^ is a original contribution of 
major signihcance, because for the hrst time, it proposed to map the data 
into another space to learn a more effective predictor in the new space for 
multivariate performance measure optimization. Specihcally, it uses multiple 
kernel learning to map the input data to a new space, and then learns 
a new predictor to optimize the desired multivariate performance measure. 
Our work also follows this strategy, but our work uses sparse coding to map 
the original input data to a new sparse code space, instead of using multiple 
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kernel learning. Moreover, our method also learns a new predictor in the new 
space to optimize the multivariate performance measure. Sparse coding is an 
important and popular data representation method, and it represent a g iven 
data point by reconstructing it with regard to a dictionary B0H00- 
The reconstruction coefficients are imposed to be sparse, and used as a new 
representation of the data point. Sparse coding has been used widely in both 
machine learning and computer vision communities for pattern classihcation 
problems. For example, Mairal et al. proposed to learn the sparse codes 
and a classiher jointly on a training set. However, the loss function used 
in this method is a traditional logistic loss. In this paper, we ask the fol¬ 
lowing question: How can we learn the sparse codes and its corresponding 
class prediction function to optimize a multivariate performance measure? 
To answer this question, we propose a novel multivariate performance op¬ 
timization method. In this method, we try to learn sparse codes from the 
tuple of training data points, and apply a linear function to match the sparse 
code tuple against a candidate class label. Based on the linear function, we 
design a hyper-predictor to predict the optimal class label tuple. Moreover, 
to the loss function of the desired multivariate performance measure is used 
to compare the prediction of the hyper-predictor and the true class label tu¬ 
ple, and minimized to optimize the multivariate performance measure. The 
contributions of this paper are of two folds: 


1. We proposed a joint model of sparse coding and multivariate perfor¬ 
mance measure optimization. We learn both the sparse codes and the 
hyper-predictor to optimize the desired multivariate performance mea¬ 
sure. The input of the hyper-prediction function is the tuple of the 
sparse codes, and the output is a class label tuple, which is further 
compared the to the true class label tuple by a multivariate perfor¬ 
mance measure. A joint optimization problem is constructed for this 
problem. In the objective function of the optimization problem, both 
the reconstruction error and the sparsity of the sparse code are consid¬ 
ered. Simultaneously, the multivariate loss function of the multivariate 
performance function is also included in the objective. The multivari¬ 
ate loss function may be very complex, and even does not have a close 
form, thus it is difficult to optimize it directly. We seek its upper bound, 
and approximate is as a linear function of the hyper-predictor function. 

2. We proposed a novel iterative algorithm to optimize the proposed prob¬ 
lem. We adapt the alternate optimization strategy, and optimize the 
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sparse code, dictionary and the hyper-predictor function alternately in 
an iterative algorithm. Both sparse codes and hyper-predictor param¬ 
eters are learned by gradient descent methods, and the dictionary is 
learned by Lagrange multiplier method. 

1.3. Paper organization 

This paper is organized as follows. In section [2l we introduce the proposed 
multivariate performance measure optimization method. In section |31 the 
proposed method is evaluated experimentally and compared to state-of-the- 
art multivariate performance measure optimization methods. In section |H 
the paper is concluded with future works. 

2. Proposed method 

In this section, we introduce the proposed method. We hrst model the 
problem with an optimization problem, then solve it with an iterative op¬ 
timization strategy, and hnally develop an iterative algorithm based on the 
optimization results. 

2.1. Problem formulation 

Suppose we have a tuple of n training data points, x = (xi,--- ,x„), 
and its corresponding class label tuple is denoted as ^ = (i/i, • • • , ?/n), where 
Xj G is the d-dimensional feature vector of the z-th training data point, and 
Pi G {-|-1,—1} is the binary label of the z-th training data point. We can use 
a machine learning method to predict the class label tuple, y* = (y^, • • • , y*), 
where y* is the predicted class label of the Tth data point. A multivariate 
performance measure, A(y, y*), is dehned to compare a predicted class label 
tuple y* of a data point tuple against its true class label tuple y. To learn 
a hyper-predictor to map a data point tuple x to a optimal class label tuple 
y*, we should learn it to minimize a desired pre-defined multivariate perfor¬ 
mance measure, A(y, y*). The proposed learning framework is shown in the 
flowchart in Fig. [H 

We propose to present the data points to their sparse codes by sparse 
coding method, and then use a linear hyper-predictor to predict the class 
label tuple. We consider the follow problems in the learning procedure, 

• Sparse coding of data tuple: To represent the data points in the 
data tuple, we propose to reconstruct each data point in the data tuple 
by using a dictionary. 
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Figure 1: Flowchart of the proposed learning framework. 


Xi ^ ^ Sijdj = Dsi, 7 = 1, • • • , n, (1) 

i=i 

where dj G is the j-th dictionary element of the dictionary, and 
D = [di, • ■ ■ , dm] is the dictionary matrix with its j-th column as the j- 
th dictionary element, and m is the number of the dictionary elements. 
Sij is the coefficient of the j-th dictionary element for the reconstruction 
of the 7-th data point, and Sj = [s^i, • • • , Sim]"'" ^ is the coefficient 
vector for the reconstruction of the 7-th data point. We assume that 
for each data point, only a few dictionary elements are used, thus its 
coefficient should be sparse, and we also call it sparse code of the data 
point. To learn the dictionary and the sparse codes of the data tuple, 
we propose to minimize the reconstruction error and encourage the 
sparsity of the sparse codes, and the following optimization problem is 
obtained over the data tuple. 


n 

min ^ (||xi - Dsi|| 2 -k Cillsilli) , 

i=l 

s.t. ||dj||^ < c,V j = 1, • • • ,m. 


( 2 ) 


In the objective function, the hrst part of each term is the reconstruc¬ 
tion error measured by squared ^2 norm, and the second part is the 
sparsity measured by the norm of Sj. Ci is a tradeoff parameter to 
control the sparsity of Sj. If we have a larger value of Ci, the learned 
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Sj will be more sparse. The optimal value of this parameter can be 
selected by linear search or cross validation. 

• Learning of hyper-predictor: We apply a linear function, f{s,y'), 
to compare the tuple of sparse codes of the data tuple, s = (si, • • ■ , s„), 
against a candidate class tuple, y' = {Vi, ■'' i v'n)) 

n 

/(s,y') = (3) 

i=l 

where w G is the parameter vector of the function. Then we the 
candidate class label tuple y' which archives the largest response of 
f(s,y') will be output as the optimal class label tuple, 

y* = aTgmaxf{s,y') 

y'ey ^ > 

where y = {+ 1 ,- 1 }’^ is the hyper-space of the candidate class label 
tuple. To learn the linear function parameter vector w for the hyper¬ 
predictor and the sparse codes, we propose to learn it by minimiz¬ 
ing a loss function of a pre-dehned multivariate performance measure, 
A(lJ*,y). To reduce the complexity of the linear function, we also pro¬ 
pose to minimize the squared £2 norm of the linear function parameter 
w. Thus we propose the following optimization problem to learn w. 


min 





+ CsA{r,y) 




( 5 ) 


where C 2 and are other tradeoff parameters. C 2 is the weight of the 
model complexity penalty term, and a larger C 2 can leads to a simpler 
model. 6*3 is the weight of the loss functions over the training data 
points, and a larger value of C 3 can lead the model to £t the training 
set better. The values of C 2 and C 3 can be selected by linear search 
of cross validation. Direct minimization of A{y*,y) is difficult, thus 
we seek its upper bound and minimize its upper bound to optimize 

A(r,y)- 


Theorem 1 The upper bound of A{y*, y) can be obtained as follows. 
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( 6 ) 


i^r-r&y ^y" 

where 

n 

F{y") = '^{y'i - l/i)w^s* + A{y",y), (7) 

i=l 

and 

^ fi, tf F{r) > F{r),^r e 3^ .. 

^ ]^0, otherwise. ^ 

Proof According to (jl]), since y* achieves a maximum f{s,y'), we have 

f{s,y*) > f{s,y) 

^ f(s,y*) - f(s,y) >0 (9) 

^ f{s,y*) - /(s,y) + A{y*,y) > A{y*,y). 

Substituting ([3]) to the left hand of ([9]), and according to the 
definition of function F{y") in ([7]), we have 


/(s,2/*) - f{s,y) + A{y*,y) 

n n 

= ^ + A{y*,y) 

i=\ i=l 

n 

= ~ yi)^^^i + ^iy*^y) 

i=l 


= F(r). 

Thus Q can be rewritten as 


( 10 ) 


F(r)> A(r,!/). (11) 

To hnd the upper bound of F{y*), we scan all the candidate class 
label tuples y” G 3^, and seek the one or more candidates which 
can achieve the maximum F{y"), and we can see the maximum 
F{y") is a upper bound of F{y*), 
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maxF(/) > F{y* 
y''&y 


(12) 


Moreover, we also define a indicator r^// for each y” to indicate if 
y" achieves the maximum F{y"), as in (E]). In this way, we can 
rewrite the left hand of flT^ as follows. 


y''&y 

Thus we have 


max F{y'') = 


Y 2 y":y' 


r-.rey V 


Fj"F{y"). 


(13) 


1 

^r':y"Gy V 


F,"F{y”) 

y":y”ey 


maxF{y") > F{y*) > A{y*,y). 
y"&y 

(14) 


Instead of minimizing A{y*,y), we minimize its upper bound in (E]), 
and ([5]) is turned out to 


min 


W,Si| 


n 

i=l 



Cs 

Y.r-.y"ey V' 


y":y"Gy 



^3 

Y.r-.y"€y V 





yi)w Si + A{y",y) 


(15) 


The overall optimization problem is obtained by combining both problems 
in (Ej) and (|T^ . 




^2 II ||2 , 




2 " 


^y" I ~ + ^(y"^ y) 

y":y"ey ‘v" yff.^y'f^y 


. ^=1 


s.t. \\dj\\l < c,V j = 1, 


m. 


(16) 
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Optimizing sparse 
codes 




Optimizing linear 


function parameter ^ 


Optimizing 
dictionary D 



Figure 2: Flowchart of the alternate optimization strategy. 


In this problem, we learn the dictionary, sparse codes, and the hyper-predictor 
parameter jointly. 

2.2. Problem optimization 

To optimize the problem in flT6|) . we use the alternate optimization strat¬ 
egy. In an iterative algorithm, the variables are updated in turn. When the 
sparse codes are optimized, the linear function parameter and the dictionary 
are hxed. When the linear function parameter is optimized, the sparse codes 
and the dictionary are fixed. When the dictionary is optimized, the sparse 
codes and the linear function parameter is hxed. This strategy is shown in a 
howchart in Fig. |2l 

2.2.1. Optimization of sparse codes 

When we try to optimize the sparse codes, we hx the dictionary and the 
linear function parameter, and optimize the sparse codes one by one, i.e., 
when one sparse code Sj is considered, other sparse codes are hxed. 

Thus we turn the problem in flTbl) to the following optimization problem by 
only considering s,, and removing terms irrelevant to Sj, 


mm 

Si 



( 17 ) 
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We rewrite the sparsity term in (fT7|) . ||sj||i, as follows, 


m 



i=i 




( 18 ) 


where diag • • • j ^ diagonal matrix with its j-th diagonal ele¬ 

ment as To make the objective function a smooth function, we £x the 
sparse code elements in the diagonal matrix as the elements of the previous 
iteration. Moreover, we note that TyU is also a function of Sj as shown in (|H]). 
We also hrst calculate by using sparse codes solved in the previous iteration, 
and then £x it when we consider Sj in the current iteration. In this way, (na 
is changed to 


min < g{si) 

Si 


|xj — i?Sj ||2 -l- Cisldiag 



Si 






pre 

) 'T—ff 




pre 


{y'l 


r--y"^y ‘r 



(19) 


where is the j-th element of Sj solved in previous iteration, and is 
Ty" calculated using previous solved Sj and w. To seek its minimization, we 
update Sj by descending it to its gradient of the object g{si), 


cur 






( 20 ) 


where is the sparse code updated in current iteration, is the sparse 
code solved in previous iteration, g is the descent step, and Vg'(sj) is the 
gradient of g{si), which is defined as 


Vg{si) = 2D^ (xj — Dsi) + 2Cidiag 


C, 


'lly":y"£y V' yii.yii^y 


( ^ •• 

1 

1 1 pre 1 ) 

? 1 vre 1 

V 

\^im \ 

- yi)w. 



( 21 ) 
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2.2.2. Optimization of linear function parameter 

By only considering w in flT6|l . fixing sparse codes, dictionary, and Ty" as 
results of previous iteration, and removing the terms irrelevant to w, we turn 
ffT6|) to 


min 

W 




a. 




E 


pre 

:l/"GyV' y"-Jy"ey 


pre 

V' 



yi)w^Si + A(?/", y) 


( 22 ) 

To minimize this objective function, we update w by descending it to the 
gradient of /i(w). 


^cur ^ ^pre _ ^Vh(w) | w=wP- , (23) 

where V/i(w) is the gradient of h{w), which is dehned as 

n 

Vh(w) = C2W + ^^-5 rf^Yiy- -yi)si (24) 

2^y"-.y"ey V r^-V'^y ^=1 

2.2.3. Optimization of dictionary 

To optimize the dictionary matrix D, we remove the terms irrelevant 
to D from the objective, £x the other variables, and obtain the following 
optimization problem. 


n 

min’V ||xj - DSi||2 , ^ ^ 

^ ( 25 ) 

s.t. ||dj ||2 < c, V j = 1, • • • , m. 

The dual optimization problem for this problem is 
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max min ^ ||xi - Dsi\\l + (Ild^Ha - c) 

“^■'>=1 I *=i j=i 


iixi - DsiWi + Y ^MiWl - Y ’ 


2 = 1 
n 


j=l 


j=l 


( 26 ) 


I|x,: - DSjll^ + Tr (Ddiag(air-- ^ 

i=i 


CtjC , 


2 = 1 


s.t. aj > 0, j = 1, 


where aj is the Lagrange multiplier for the constrain HdjUl < c, diag(ai, ■ ■ ■ , am) 
is a diagonal matrix with ist diagonal elements as ai, ■ ■ ■ , am, and C{D, aj\JLi) 
is the Lagrange function. To minimize the Lagrange function with regard to 
D, we set its gradient with regard to D to zero, and we have 


V£d = “2 ^ (xj - Dsi) sj + 2Ddiag{ai, • • • , am) = 0, 

2 = 1 

D = X^s7^ SjsJ + diag{ai, • • • , am) 

To solve the Lagrange multiplier variables, we use the gradient ascent algo¬ 
rithm to obtain cti, • • • , am in each iteration. After we obtain cti, • • • , am, 
we can obtain D according to fl27p . 



2.3. Iterative algorithm 

Based on the optimization results, we develop a novel iterative algorithm, 
named JSCHP. The algorithm is described in Algorithm [1] As we can see 
from the algorithm, the iterations are repeated T times, and in each itera¬ 
tion, the variables are updated sequentially. The flowchart of the proposed 
iterative algorithm is given in Fig. |3l 

The novelty of this algorithm is of three folds: 

1. This algorithm is the hrst algorithm to learn the sparse codes, dictio¬ 
nary and hyper-predictor jointly. 

2. This algorithm is the hrst algorithm to use gradient descent principle 
to update the hyper-predictor parameters. Traditional hyper-predictor 
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Figure 3: The flowchart of the iterative algorithm of JSCHP. 


parameter learning method for multivariate performance optimization 
is based on solving a quadratic programming problem in each itera¬ 
tion, which is time-consuming. Our algorithm gives up the quadratic 
programming problem, and instead, we used a simple gradient descent 
rule to update the parameters efficiently. 

3. This algorithm is also the hrst algorithm to solve the sparse codes 
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using the gradient descent rule. Traditional sparse coding algorithm 
solve the sparse codes by optimizing a £2 norm regularized problem 
directly, which is not convex and time-consuming. We convert the £i 
norm regularization to a £2 norm regularization, which can be easily 
solved by gradient descent because it is convex. 

Please note that the input of the iterative algorithm requires the param¬ 
eters Cl, C 2 and C 3 . Cl is the weight of the sparsity term of the sparse code, 
C 2 is the weight of the model complexity term, and C 3 is the weight of the 
losses over the training set. 

3. Experiments 

In this experiment, we evaluate the proposed algorithm and compare it 
against state-of-the-art multivariate performance optimization methods. 

3.1. Data sets 

In the experiment, we used the following three data sets. 

• VANET misbehavior data set; The hrst data set is for the problem 
of detecting misbehaving network nodes of Vehicular Ad Hoc Networks 
(VANETs) [^. To construct this data set, we used NCTUns-5.0 sim¬ 
ulator to conduct simulations, and collected data of 1395 nodes. These 
nodes belong to two different classes, which are honest nodes and mis¬ 
behaving nodes. The number of honest nodes is 837, and the number 
of the misbehaving nodes is 558. Given a candidate nodes, the problem 
of misbehavior detection is to predict if is a honest node, or a misbe¬ 
having node. Thus this is a binary classihcation problem. To extract 
the features from each node, we calculate multifarious features, includ¬ 
ing speed-deviation of node, received signal strength (RSS), number of 
packets delivered, dropped packets etc. 

• Profile injection attacks data set; The second data set is for 
the problem of detecting prohle injection attacks in collaborative rec- 
ommender systems j^. It is well known that collaborative recom- 
mender systems is vulnerable to prohle injection attacks. Injection 
attacks is dehned as malicious users inserting fake prohles into the rat¬ 
ing database, and biasing the systems’ output. To construct the data 
set, we randomly select 1000 genuine user prohles from Movielens IM 
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dataset as positive data points, and randomly generate 300 attacking 
fake user profiles as negative data points. The problem of profile injec¬ 
tion attacks detection is to classify a candidate user profile to genuine 
user or fake user. To extract features from each user prohle, we hrst 
calculate its rating series based on the novelty and popularity of items, 
and then use the empirical mode decomposition (EMD) to decompose 
its rating series, and hnally extract Hilbert spectrum based features. 

• UT-kinect 3D action data set: The third data set if for the problem 
of recognizing human actions from 3D body data. In this data set, 
there are 200 3D body data samples, and each 3D body data samples 
is treated as a data point. These data points belong to 10 different 
action classes. The number of data points for each class is 20. The 10 
classes are listed as follows: walk, sit down, stand up, pickup, carry, 
throw, push, pull, wave and clap hands j^. To extract features from 
each data point, we calculate the histogram of the 3D joints of each 
data point. 

3.2. Experiment setup 

To perform the experiments, we used the 10-fold cross validation. A data 
set is split to 10 folds randomly. Each fold was used as a test set in turn. The 
remaining 9 folds were combined and used as a training set. Given a desired 
multivariate performance measure, we performed the proposed algorithm on 
the training set to learn the dictionary and the classiher parameter. Then we 
used the learned dictionary and the classifier to classify the test data points. 
Finally, we compared the classihcation results of the test data points against 
the true class labels using the given multivariate performance measure. 

The following multivariate performance measures were used. 

• FI score: The hrst multivariate performance measure is the El score, 
and it is dehned as 


Flscore 


2 X Number of correctly classified positive data points 


2 X Number of correctly classified positive data points 
+Number of wrongly classified data points 



( 31 ) 
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• PRBEP: The third multivariate performance measure is PRBEP, precision- 
recall curve eleven point. It is dehned as a point where precision and 
recall values are equal to each other. The precision-recall curve is ob¬ 
tained by plotting precisions against recalls. Precision and recall are 
dehned as, 


precision = 


Number of correctly classified positive data points 


recall = 


Nnumber of data points classified as positive 
Number of correctly classified positive data points 
Total number of positive test data points 


(32) 

We can generate different groups of precisions and recalls, and plot 
precisions against corresponding recalls to obtain the precision-recall 
curve. The point in the curve where precision is equal to the recall is 
dehned as PRBEP. 


• AUC: The second multivariate performance measure is the AUC, area 
under operating characteristic curve. Operating characteristic curve is 
dehned as a curve obtained by plotting true positive rate against false 
positive rate. True positive rate and false positive rate are dehned as 
follows. 


true positive rate = 

Number of correctly classified positive data points 
Total number of positive test data points 
false positive rate = 

Number of wrongly classified negative data points 
Total number of negative test data points 
By changing a threshold parameter of the classiher, we can have dif¬ 
ferent groups of true positive rates and false positive rates. Plotting 
true positive rates against its corresponding false positive rates, the 
operating characteristic curve can be obtained. 

3.3. Experiment results 

3.3.1. Comparison to state-of-the-arts 

In this experiment, we hrst compared the proposed algorithm JSCHP 
to some state-of-the-art machine learning algorithms for multivariate perfor- 


(33) 
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VANET misbehavior data set 




JSCHPCAMPMFSMPM MPMS CPSP 
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(a) Flscore (b) PRBEP (c) AUC 

Figure 4: Results of comparison to state-of-the-arts on VANET misbehavior data set. 


mance optimization, including the cutting-plane subspace pursuit (CPSP) 
101, multivariate performance measure smoothing (MPMS) [l^, feature se¬ 
lection based multivariate performance measure optimization (FSMPM) 0. 
and classiher ada pta tion based multivariate performance measure optimiza¬ 
tion (CAMPM) [2l|. The boxplots of different performances measures of 
the 10-fold cross validation over different data sets are given in Fig. IU [5] 
and El From these hgures, we can see that the proposed algorithm JSCHP 
outperforms the compared algorithms in most cases. For example, in the 
experiments over VANET misbehavior data set, when PRBEP performance 
is considered, only JSCHP algorithm achieves a median value higher than 
0.6, while the media values of all other algorithms are lower than 0.6. More¬ 
over, in the experiments over UT-kinect 3D action data set, we can see that 
the median value of the FI scores of JSCHP is even higher than the 75-th 
percentile values of other algorithms. These are strong evidences that the 
proposed algorithm is more effective than the compared algorithms for the 
problem of optimizing multivariate performance measures. It is also inter¬ 
esting to see that AUC seems a easier multivariate performance measure to 
optimized than FI score and PRBEP. In all the experiments over three data 
sets, the observed AUC values are higher then corresponding FI scores and 
PRBEP values. The results of CAMPM, FSMPM and MPMS are compara¬ 
ble to each other, and better than CPSP. 
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Figure 5: Results of comparison to state-of-the-arts on profile injection 
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Figure 6: Results of comparison to state-of-the-arts on UT-kinect 3D action data set. 
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3.3.2. Parameter sensitivity 

We are also interested in the sensitivity of the proposed algorithm against 
three tradeoff parameters Ci, C 2 and C 3 . Thus we varied the tradeoff pa¬ 
rameters Cl, C 2 and C 3 contemporaneously to compute the sensitivity of the 
algorithm to the parameters. The average Fi score of the proposed algorithm 
of combinations of different values of these parameters are given in Fig. [71 
and Fig. 


From Fig. 7(a 


7(b), we can see that when Ci is increasing, the 


performances are also being improved. Ci is the weight of the sparsity term 
of the sparse code, and from the experiment results, we can conclude that 
when we have a larger sparsity penalty, the performance can be better. This 
means that a sparse representation is important for learning hyper-predictor 
to optimize multivariate performance measures. It is well known that sparse 
representation can beneht the learning of a good classiher using common and 
simple performance measures. However, it is still unknown if such sparse 
representation can also beneht the learning of hyper-predictor for complex 
multivariate performance measure optimization. Our experiments answer 
this question, and we hnd that the sparsity of the presentation is also impor¬ 
tant for the optimization of complex multivariate performance measures, just 
like it works for the simple performance measure optimization. From Fig. 
and Fig. 


7(a; 


7(c), we can see that the improvement of the performances 


against the C 2 parameter is not clear. However, the performance is stable 
for different parameters. This parameter is the weight for the complexity of 
the hyper-predictor parameter. From the results, we cannot conclude that a 
simpler predictor can optimize the multivariate performance measure better 
than a complex predictor. From Fig. 7(b) and Fig. 7(c), we can see that a 


larger can also improve the performance. This is because C 3 is the weight 
of the upper bound of the corresponding loss function. A larger can lead 
to a better solution for the minimization of the loss function, and thus leads 
to a better performance measure. 


3.3.3. Running time 

We are also interested in the running time of the proposed algorithm 
and the compared algorithms. The boxplots of running time of different 
algorithms of the 10-fold cross validation over UT-kinect 3D action data set 
is given in Fig. | 8 l It is obvious that the proposed algorithm has shorter 
running time than the other algorithms. A possible reason is that the other 
algorithms are based on cutting-plane algorithm. In this algorithm, in each 
iteration, a active set is maintained, and a quadratic programming algorithm 
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Figure 7: Parameter sensitivity curves of Fi scores over UT-kinect 3D action data set. 
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Figure 8: Boxplots of running time of 10-fold cross validation over UT-kinect 3D action 
data set. 

is solved over this active set. The solving of the quadratic algorithm is 
time consuming. Moreover, to update the active set, we need to seek a 
maximization over all possible class label tuples. However, in our algorithm, 
we only seek a maximization in the class label tuple space to approximate the 
upper bound, and no quadratic programming problem is considered, while 
only a gradient descent updating procedure is conducted. 

4. Conclusion and future works 

In this paper, we proposed a novel method for the problem of multi¬ 
variate performance measure optimization. This method is based on joint 
learning of sparse codes of data point tuple and a hyper-predictor to predict 
the class label tuple. In this way, the sparse code learning is guided by the 
minimization of the multivariate loss function corresponding to the desired 
multivariate performance measure. Moreover, we also proposed a novel up¬ 
per bound approximation of the multivariate loss function. We model the 
learning problem as an minimization problem and solve it by developing a it¬ 
erative algorithm based on gradient descent method. The proposed algorithm 
is compared to state-of-the-art multivariate performance measure optimiza¬ 
tion algorithms, and the results show its advantage. In the future, we will 
consider extend the proposed framework to structured label prediction prob¬ 
lem, since it is similar to multivariate performance measure optimization. In 
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the future, we will also use the proposed algorithm for the application of 
computer vision [33|, [34 . 
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Algorithm 1 Iterative learning algorithm of joint learning of sparse code 
and hyper-predictor parameter for mnltivariate performance measure opti¬ 
mization (JSCHP). 

Input: A training tuple of n data points x = (xi, • • • , x„), and its corre¬ 
sponding class label tuple y = {yi, - ■ ■ , yn)] 

Input: Tradeoff parameters Ci, (72, C^. 

Input: Maximum iteration number T; 

Initialize w° and 

for t = 1, • • ■ , T do 

Update by the updating rule in and hxing and 


D* = 


, ^=1 


A-1 



+ diag{a\ \ 

i=l 


, a. 


t-i\ 


(28) 


for z = 1, • • • , n do 

Update s* by the updating rule in (1^ and hxing and D^, 


4 =8* ^ 
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end for 

Update w* by the updating rule in fl25]l and hxing 


(29) 


= w* ^ 


rj C2w‘-' + 


(7, 


^y''-.y' 


y":y"ey '^y" yii-iyii^y 
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Ty» 


E(rf 

Z=1 



Upate by hxing and using gradient ascent; 

end for 

Output: The sparse codes dictionary matrix and hyper¬ 

predictor parameter w^. 
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